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1 Introduction 



> 

O _ 

^ ! Abstract 
^ , 

^ . As an entry for the 1999 Gordon Bell price/performance prize, we report an astropliys- 

^ ! ical A-body simulation performed with a treecode on GRAPE-5 ( Gra vity Pipe 5) system, 
I a special-purpose computer for astrophysical A-body simulations. The GRAPE-5 system 
has 32 pipeline processors specialized for the gravitational force calculation. Other oper- 
ations, such as tree construction, tree traverse and time integration, are performed on a 
general purpose workstation. The total cost for the GRAPE-5 system is 40,900 dollars. We 
On '• performed a cosmological A-body simulation with 2.1 million particles, which sustained a 
^ I performance of 5.92 Gflops averaged over 8.37 hours. The price per performance obtained 
'-^1 is 7.0 dollars per Mflops. 

O 

Astrophysical A-body simulation is one of the most widely used technique to investigate 
^ , formation and evolution of astronomical objects, such as galaxies, galaxy clusters and large 
^ I scale structures of the universe. In such simulations, we calculate gravitational force on 
each particle from all other particles, and integrate the orbit of each particle according to 
Newton's equation of motion. We investigate structural and dynamical properties of the 
simulated object. 

The astrophysical A-body simulation has been one of grand challenge problems in 
computational sciences. In years 1992, 96, 97, and 98, the Gordon Bell prizes were awarded 
to cosmological A-body simulations [0 [01 10 iii 1995 the Gordon Bell prize is awarded 
to A-body simulation of a black hole binary in a galaxy |^. The calculation cost of the 
astrophysical A-body simulation rapidly increases for large A, because it is proportional 
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to A^^ if we use a straightforward approach. The gravity is a long-range attractive force. 
A particle feel the forces from all other particles, no matter how they are far away. We 
cannot use a cutoff technique which is widely used in MD simulation (e.g. P). In order to 
reduce the calculation costs, various fast algorithms have been developed. 

Hierarchical tree algorithm ^ is one of such fast algorithms which reduce the calculation 
cost from 0{N'^) to 0{NlogN). In this algorithm, particle are organized in the form of a 
tree, and each node of the tree represents a group of particles. The force from a distance 
node is replaced by the force from its center of mass. The Gordon Bell prizes of years 1992, 
97 and 98 were awarded to iV-body simulations with this tree algorithm which 
were performed on Intel Touchstone Delta, ASCI-Red, PC cluster, and an Alpha cluster. 

We report an astrophysical A^-body simulation with the tree algorithm on GRAPE-5 
(GRAvity PipE) special-purpose computer. GRAPE-5 has dedicated pipelines specialized 
for the calculation of the gravitational force. It is connected to a host computer, which is 
general purpose workstation, and operates as a hardware accelerator for the calculation of 
the gravitational force. Other operations, such as tree construction, tree traverse and time 
integration, are performed on the host computer. It has been already demonstrated that 
the approach using special-purpose machines successfully achieved very high performance 
in scientific computations, by the Gordon Bell prize simulations of 1995 and 96 which 
were performed on GRAPE-4p|, and the last Gordon Bell prize simulation 0, which was 
performed on QCDSPfll. 

We performed a cosmological 2.1 million particles simulation using the tree algorithm 
on GRAPE-5 connected to a COMPAQ AlphaServer DSIO. Sustained performance is 5.92 
Gflops and price/performance is $7.0/Mfiops. In the rest of this paper, we describe on 
GRAPI3-5 system and the tree algorithm on GRAPE, and report the cost and performance. 



2 GRAPE-5 system 

We briefly describe architecture of the GRAPE-5 system. More detailed descriptions of 



the GRAP15-5 system will be given elsewhere |]TT|. GRAPIil-5 is designed to run the tree 
code with very high speed. Figure 1 summarizes the configuration of the GRAPE-5 system 
used for the simulation reported in this paper. The GRAPE-5 system consists of 2 processor 
boards, 2 host interface boards, and a host computer. The processor board performs 
the force calculation. The host interface board handles the communication between the 
processor board and the host computer. The host computer performs all other operations. 
We used COMPAQ AlphaServer DSIO with a 21264/466MHz Alpha processor for the host 
computer. Figure 2 and figure 3 are photographs of the GRAPE-5 system and GRAPE-5 
processor board, respectively. 

Each processor board consists of 8 processor chips (G5 chip) and a particle data memory. 
G5 chip is a custom LSI chip which calculates the gravitational force. Each G5 chip houses 
2 pipelines specialized for the force calculation. The particle data memory stores the data 
of particles which exert the force and supplies them to G5 chip. G5 chip operates at 90MHz 
and other part of the processor boards operate at 15MHz. 

G5 chip is designed for astrophysical A^-body simulations with the tree algorithm and 
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Figure 1: Block diagram of the GRAPE-5 system 




Figure 2: Photograph of the GRAPE-5 system 
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Figure 3: Photograph of the GRAPE-5 processor board 



calculates a pair-wise force with a relative error of about 0.3%. This might sound rather 
low, but detailed theoretical analysis |]12| and numerical experiment [0 have shown that it 
is more than enough. The average error of the force in our simulation is around 0.1%, which 
is dominated by the approximation made in the tree algorithm and not by the accuracy 
of the hardware. The relative accuracy was practically the same when we performed the 
same force calculation using standard 64-bit floating point arithmetic. 

The theoretical peak speed of the GRAPE-5 system is 109.44 Gfiops. Total number of 
pipeline processors is 32. Each processor pipeline operates 38 operations in a clock cycle, 
if we use the same counting convention as used in P] H] . 



3 Tree algorithm 

Our code [|1^] is based on the Barnes's modified tree algorithm [|15|. The implementation 



of the modified tree algorithm on GRAPE were discussed in |16]. Using this algorithm, the 
calculation cost on the host computer is greatly reduced from that of the original algorithm 
and the forces exerted on multiple particles can be calculated in parallel. In the original 
algorithm, the interaction list is created for each particle. In the modified tree algorithm, 
neighboring particles are grouped and one interaction list is shared among the particles in 
the same group. Forces from particles in the same group is directly calculated. 

The modified tree algorithm reduces the calculation cost of the host computer by roughly 
a factor of n^, where Ug is the average number of particles in a group. On the other hand. 
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the amount of work on GRAPE-5 increases as we increase n^, since the interaction hst 
becomes longer. There is, therefore, an optimal rig at which the total computing time is 
minimum. The optimal rig strongly depends on the ratio of the speed of the host computer 
and GRAPE. For the present configuration, the optimal rig is around 2000. 

Note that our modified tree algorithm performs larger number of operations than the 
tree algorithm on a general purpose computer. When we will estimate the performance in 
section 5, we will make correction. Note also that the our modified tree algorithm is more 
accurate than the original tree algorithm for the same accuracy parameter, as shown in 

ma 



4 Cost 

The total cost of the GRAPE-5 system is 4.7 M JYE. The GRAPE-5 board is available 
from a Japanese commercial company for the price of 1.65 M JYE per board. Remaining 
1.4 M JYE was spent for the host computer, COMPAQ AlphaServer DSIO, including 512 
MByte main memory and C-|--|- compiler. The total cost, with the present exchange rate 
of 1 dollar = 115 JYE, is about 40,900 dollars. 



5 Simulation 

We report the performance statistics for the astrophysical A^-body simulations with the 
tree algorithm on GRAPE-5. The performance numbers are based on the wall-clock time 
obtained from UNIX system timer on the host computer (COMPAQ AlphaServer DSIO). 

We performed a cosmological A^-body simulation of a sphere of radius 50Mpc (mega 
parsec) with 2,159,038 particles for 999 timesteps. We assigned the initial position and 
velocities to particles in a spherical region selected from a discrete realization of density 
contrast field based on a standard cold dark matter scenario using COSMICS package |1^ . 



A particle represents 1.7 x lO^'' solar masses. We performed the simulation from z = 24, 
where z is redshift, to the present time. Figure 4 shows a snapshot of the simulation. 

The total number of the particle-particle interactions is 2.90 x 10^^. This implies that the 
average length of the interaction list is 13,431. The whole simulation took 30,141 seconds 
(8.37 hours) including I/O, resulting in the average computing speed of 36.4 Gflops. Here 
we use the operation count of 38 per interaction. 

However, as we described in section 3, our modified tree algorithm performs larger 
number of operations than the tree algorithm on a general purpose computer. In order to 
make correction, we estimated the operation count of the original tree algorithm for the 
same simulation, using five snapshot files and the same accuracy parameter. The estimated 
number of the interaction is 4.69 x 10^^. The effective sustained speed is 5.92 Gflops and 
the price/performance is $7.0/Mfiops. 
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Figure 4: A snapshot of the simulation at z = (present time). Particles in a 45Mpc x 
45Mpc X 2.5Mpc box are plotted. 
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